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Abstract 

Tight bounds on the block entropy of patterns of sequences generated by independent and 
identicaUy distributed (i.i.d.) sources are derived. A pattern of a sequence is a sequence of integer 
indices with each index representing the order of first occurrence of the respective symbol in the 
original sequence. Since a pattern is the result of data processing on the original sequence, its 
entropy cannot be larger. Bounds derived here describe the pattern entropy as function of the 
original i.i.d. source entropy, the alphabet size, the symbol probabilities, and their arrangement 
in the probability space. Matching upper and lower bounds derived provide a useful tool for very 
accurate approximations of pattern block entropies for various distributions, and for assessing 
the decrease of the pattern entropy from that of the original i.i.d. sequence. 

Index Terms: patterns, index sequences, entropy. 



1 Introduction 



Several recent works (see, e.g., [T], [6], [T], [12], [15], [16]) have considered universal compression for 

patterns of independent and identically distributed (i.i.d.) sequences. The pattern of a sequence 

= (xi,X2, . . . , Xn) is a sequence -0" = i/> = (x") of pointers that point to the actual alphabet 
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letters, where the alphabet letters are assigned indices in order of first occurrence. For example, 
the pattern of all sequences = lossless, = sellsoll, = 12331433, and = 76887288 is 
'(/;"' = ^ (x") = 12331433. Capital ^{■) is used to denote the operator of taking a pattern of a 
sequence. A pattern sequence thus contains all positive integers from 1 up to a maximum value in 
increasing order of first occurrence, and is also independent of the alphabet of the actual data. 

Universal compression of patterns is interesting when compressing sequences generated by an 
initially unknown alphabet, such as a document in an unknown language. In such applications, 
separate dictionary and pattern compression can be performed. Most initial work on this topic 
focused on showing diminishing universal compression redundancy rates for the individual sequence 
case [6], [3, and for the average case [12], [15], [16]. However, since a pattern "if (x") is the result 
of data processing on the original sequence x", its entropy must be no greater than that of the 
original sequence. Specifically, if x" is generated by an i.i.d. source of alphabet size k, 

nHg (X) - log [kl/ (max{0, k - n})\] < He (^'") < nHg {X) , (1) 

where capital letters denote random variables, and is the probability parameter vector govern- 
ing the source. The lower bound is since Hg (^'") = Hg (X",^") - Hg (X"!^'") = Hg (X") - 
Hg (X"!^'"), where the second equality is because there is no uncertainty about given X". Fi- 
nally, Hg (X"|^'") is bounded by logarithnj^ of the total possible mappings from indices to symbols. 

The bounds in ([1]) already show that for k = o(n), the pattern entropy rate equals the i.i.d. one 
for non-diminishing Hg{X). However, the bounds in ([T]) are usually loose. Specifically, the descrip- 
tion length shown for sufficiently large alphabets in (see also ^3\) for a universal sequential 
compression method for patterns was significantly smaller than the block i.i.d. entropy. This indi- 
cates that not only is there an entropy decrease in patterns, but for large alphabets, this decrease 
is more significant than universal coding redundancy. Hence, it is essential to study the behavior 
of the pattern entropy. Pattern entropy is also important in learning applications. Consider all the 
new species an explorer observes. The explorer can identify these species with the first time each 
was seen. There is no difference if it sees specie A or specie B (and never sees the other). The next 
time the observed specie is seen, it is identified with its index. The entropy of patterns can model 
uncertainty of such processes. Its exponent gives an approximate count of the typical patterns one 
is likely to observe. If the uncertainty goes to 0, we are likely to observe only one pattern. 

Initial results from this paper, first presented in [Ijj, bounded the range of values within which 
the entropy of a pattern can be, depending on the alphabet size. Subsequently to our initial results 
^Logarithms are taken to base 2, here and elsewhere. The natural logarithm is denoted by In. 
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|14j . pattern entropy rates were independently studied with a different view of the problem in [5] 
and [8]. The main result was that for discrete i.i.d. sources the pattern entropy rate is equal to that 
of the underlying i.i.d. process. This result was also extended to discrete finite entropy stationary 
processes. Some limiting order of magnitude bounds on block pattern entropies were also provided. 

This paper extensively studies block entropy of patterns, providing tight upper and matching 
lower bounds on the block entropy. The bound pairs can be used together to provide very accurate 
approximations of the entropy of Specific distributions are studied in [13]. The basic method 
partitions the probability space into a grid of points. Between each two points, we obtain a bin. 
Symbols whose probabilities lie in the same bin can be exchanged in a given to provide another 
sequence x'" with the same pattern and almost equal probability. Counting all these sequences 
leads to the bounds on the pattern entropy. Very low probabilities are combined into one point 
mass. A key factor in obtaining tight bounds is a proper choice of increased-spacing grids. 

The outline of the paper is as follows. Section [2] defines some notation and presents some 
preliminaries. A summary of the main results in the paper is given in Section [3l Then, in Section [U 
upper and lower bounds for pattern entropy of i.i.d. sources with sufficiently large probabilities are 
derived. Section [5] contains the derivations of more general upper and lower bounds, that do not 
require a condition on the letter probabilities. Finally, Section [6] shows the range of values that the 
pattern entropy can take for bounded probabilities, depending on the actual source distribution. 

2 Preliminaries 

Let be an n-tuple with components Xi G T, = {1,2, (where the alphabet is defined 

without loss of generality). The asymptotic regime is that n ^ oo, but k may also be greater 
then n. The vector 6 = {61,62, ■ ■ ■ ,6k) is the set of probabilities of all letters in S. Since the 
order of the probabilities does not affect the pattern, we assume, without loss of generality, that 
61 < 62 < ■ ■ ■ < 6k- Boldface letters denote vectors, whose components are denoted by their indices. 
Capital letters will denote random variables. The probability of tp'^ induced by an i.i.d. source is 



j/":*{j/")=i/)" 

This probability can also be expressed by fixing the actual sequence and summing over all permu- 
tations of occurring symbols of the parameter vector, i.e.. 






(2) 






(3) 



<T={(7,,jex"} 
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where cr = {cti, . . . , cr^} is a permutation set. For example, ii 6 = (0.4,0.1,0.2,0.3) and cr = 
(3,1,4,2), then 6 {a) = (0.2,0.4,0.3,0.1) and 6* (0-2) = 61 = 0.4. The only relevant components 
of cr in (l3|) are those of occurring symbols. Thus if only m < k symbols occur in x", there are 
only k\/(k — my. elements in the sum in Q. The entropy rate of an i.i.d. source is Hq {X), and its 
sequence (block) entropy is Hg (X") = nHg (X). The pattern sequence entropy of order n is 

A 



i/e(^'") = -5]^'e(V'")logPe(V'") 



(4) 



To derive bounds on the pattern entropy, we define three different grids: r, 77, and $,, the first 
two for upper bounding and the third for lower bounding. Spacing between grid points is motivated 
by the fact that two probability parameters and 6' separated by O (^VO / y/n^^^^ ', S > 0, are near 
enough to appear similar in x"". On the other hand, if \6 — 0'\ > \fQ j ^fn' the parameters are 
far enough to appear different. For simplicity of notation, we omit the dependence on n from 
definitions of grid points. For e > 0, let r = (tq, ri, r2, . . . , r;,, . . . , r^^) be a grid of Br + 1 points 
defined by tq = 0, and 



2(j 



n 



n 



6 = l,2,...,5r. 



Let T]' be defined almost like r. 



= 



E 



2(j - \) 



n 



l+2e 



n 



l+2e ■ 



(5) 



(6) 



The grid r/ = (770, ?7i, • • • , ^B^) is defined by % = 0, r/i = n = r]2 = and 



% = ^Lu3./2|_,, b = 3,4,..., Brj. 



(7) 



'6+ [n3'=/2j-2' 

Unlike r and t?, ^ = (^Oi ^B^) is defined for lower bounds purposes. It is defined in a similar 

manner as the others, where ^0 = 0, and for an arbitrarily small e > 0, 

ft^E^^^. (B) 



n 



n 



A 



n 



l+s 



, B r, 



n 



l+2e 



For ah grids, tb^+i = r/B^+i = '^B^+i = 1- We thus have Br = 
|^j^3£/2j _|_ 2^ g^j^d -Bg = ^ . We also define the maximal indices A^, A^, and whose 



grid points do not exceed 0.5 for r, r], and ^, respectively. Hence, Ar 
^1+2=/^ - Ln3^/2J ^ 2, and = ^^^^7^/2 . 

By definition of 77, for every G [r/f,, ry^+i] where b >2, 



n 



l+e 



/V2 



1 ^ri 



Vb+1 - Vb 



2[(6 + (i) +0.5] ^ 3{b + d) 



n 



l+2e 



n 



l+2e 



n 



l+2e 



3^ ^ 3Ve 



n 



l+2e 



n 



l+2s ■ 



(9) 



where d = |^n^/^J — 1. A similar bound applies to t^, b > 1, with e in place of 2e. Similarly, 



6+1 -(b = 




We use Cb] b = 0,1, B-r, kf,; b = 0,1, ... , B^^, and Kb,b = 0,1, ... , B^, to denote the number of 
symbols for which Oi G (r^, rf,_|_i], 6i G (%,%+i], and 6i G (Cbj'^b+i], respectively. Respective vectors 
containing all components are denoted by c, k, and k. In addition, define k^; b = 1,2, .. . ,B^, as 
zero if is zero, and otherwise, as the number of symbols for which 9i £ (6-1,6+2], with the 
exception of n'^, which will only count letters for which 9i £ (Ci,6]- (There is clearly an overlap 
between adjacent counters in k', which is needed for derivation of a lower bound.) 

The grid r is defined so that all letters 6i < are grouped in the same bin. Grid rj 

also groups probabilities in (l/n^+^, in bin 1. In particular, ko and ki denote the symbol 

counts of the two groups, respectively. We will also use /cqi = + ^1 to denote the total letters 
with 0i < 1/n^^^ (thus k — /cqi denotes the count of symbols with 9i > 1/n^^^). Let 

Sie{vb,Vb+i] 

be the total probability of letters in bin b of grid ij. Of particular importance will be ipo, ipi, defined 
with respect to (w.r.t.) bins 0, 1, respectively, and (poi = (po + ifi. Define io, £1, and ioi, where 
ib = min {kh, n). 

The probability that letter i does not occur in is 

P,{i^X^) = {l-9,r. (12) 

Taking an exponent of the logarithm of (I12p . using Taylor series expansion in the exponent, 

g-n(e,+e2) < ^ j^n) < g-ne,^ -f < 3/5^ ^^^3) 

If 9i > 3/5, the upper bound is the same, but the lower bound is 0. Following (fT3|) . 

1 _ e""^' < Pe{ie X"") <1- e-"(^'+^') , (14) 

where the upper bound is replaced by 1 for 9i > 3/5. 

The mean number of occurrences of letter i in X" is given by EgN^ (i) = n9i, where (i) is 
the number of occurrences of i in x", (i) is its random variable, and Eq is expectation given 0. 
Then, the mean number of re-occurrences (beyond the first occurrence) of letter i in X" is given 

by 

EeN, (i) -PeiiG X") = n9i-l + {l- 9if . (15) 
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It is thus bounded by 

1 + e-'^(^»+'^?) < EeN^ ii) - Pe {i e X") < nOi - 1 + 6""^^ , (16) 



nt 



where, again, the last term of the lower bound is replaced by for 9i > 3/5. Using the Binomial 
expansion on (jl2p . the probability of an occurrence of letter i for 9i < 1/n can be bounded by 



ne, - (^^) e-f <Pe{ie < ne, - e-f + 9f, (17) 
and then the mean number of re-occurrences of letter i is bounded by 

(2) - (3) ^ (^) - e ^ (2) (18) 

Let Kf)] Cfc denote random variables counting the distinct symbols from bin b of rj; r, respec- 
tively, that occur in X"^. Let K denote the total number of distinct letters occurring in X". The 
mean number of distinct letters from bin b of ij that occur in X^ is 

where Lq, Li, Lqi are of specific interest, and also L = Eg [K] is computed in a similar manner. 



As in ([Ti, 

k,- J2 e-^'^<L,<h- e-<'^+'"). (20) 

Si&iVb,Vb+l] fiG(»?6,%+i], ei<3/5 

In particular, for bin 0, as in (jl7p . 

nv.0 - {fj E < ^0 < n^o - {fj E + (3) E (21) 

i=l 1=1 1=1 

Packing lower bin(s) into single point masses, we can thus define, 

k 

Hf^{X) = -<^olog9?o- (22) 

i=fco+l 
k 

Hf^Hx) = -mlogm- Y (23) 

i=fcoi+l 
1 k 

Hf^^\x) = -Yvblog^b- Y (24) 

6=0 j=A;oi+l 



The expressions in (j22p -(l24l) will be used to express some of the bounds in the paper, where low 
probability letters are packed into one or two point masses. (These expressions also depend on the 
choice of e. This dependence is omitted for convenience.) 
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3 The Main Results 



The main results in the paper are summarized below. First, if 9i > l/v} ^, Vz, the pattern entropy 
is bounded by 

^5 Ar, 

nHg{X)-Y,log{Kh\)-klog3-o{l) < Hei^"") < nHg (X) - {1 - e)Y,^og{hl) + o{k). (25) 

6=1 6=2 

Namely, the pattern entropy decreases to first order from the i.i.d. block entropy by the logarithm 
of the product of permutations within all the bins of the probability space. The bounds in (|25p 
depend on the arrangement of the letters in the probability space. However, even if we only know 
the number of letters in the alphabet, we can still bound the range that the pattern entropy can be 
in. The actual point in this range does depend on the arrangement of the letters in the probability 
space. However, if the alphabet is large enough, the pattern entropy must decrease w.r.t. the i.i.d. 
one regardless of this arrangement. In all, if > l/n^~^, Vi, we have 

nHg{X) - log {kl)< He {^^) , ' (26) 

[ nHg{X)-lklog^^^rfWU2, ifA:>ni/3+^ 

The bound above shows that the decrease in the pattern entropy w.r.t. the i.i.d. one for large 
alphabets is to first order between log k bits and log [k^'^ /^/nj bits for each alphabet letter. 

If the alphabet contains letters with low probabilities, namely, with 6i < l/n^~^ (/cqi > 0), the 
pattern entropy is upper bounded by 

Ar^ 

r(o,i) 



Hg{^^) < n<'^)(X)-^(l-e)log(A;,!) 

6=2 

+ {nipi - Li) log [min{/ci,n}] + n(/?i/i2 ( ) 

where /12(a) = —a log a — (1 — a)log(l — a). The third and forth terms contribute at most 

[mpi logn), and the last term o(n). The bound in (|27p implies that the source appears to contain 
a single letter for bin and another single letter for bin 1, and its entropy decreases, again, by 
the logarithm of the number of permutations leading to typical sequences w.r.t. all other bins. In 
addition, there is a limited penalty reflected in the last three terms for packing all letters in bins 

1 and as two point masses. This penalty is higher for bin 1, which is the boundary between 
two different asymptotic behaviors. For non-diminishing i.i.d. entropies H^^'^^ (X) the penalty of 
packing all letters in bin into one point mass is negligible. 



A lower bound of a similar nature is then obtained, showing that the pattern entropy satisfies 

b=l 



Hem > nHPiX)-Y,^ogi^b^.)-ik-Ko)log3 

b=l 
fcoi-1 

+ - 1 + e~<'^+'^ log ^ + (ne, 




i=l 

Loi-1 

+ (loge) (Loi - i) ^ - log ( ^V^'^ ) - o(l), (28) 

where A;^ denotes the number of letters with 6i G (^'d~ , \/^n}^^^^ and fcj the number of letters 
with 6i G (l/n^~'=, "iJ^/n^"^] , and •d~ and are constants, such that > 1 > > 0. This 
bound illustrates similar behavior to that in (|27p . where the pattern entropy behaves like that of 
a source for which the low probabilities in bins and 1 are packed into one point mass, and a 
similar behavior to that in (j25p is shown for greater probabilities. Packing of bins 01 results in 
correction terms reflecting the increase in entropy due to repetitions and first occurrences, and 
another correction term (the seventh term) reflecting the unclear boundary between two different 
asymptotic behaviors. For many sources, variations of the last two bounds are very close to each 
other and lead to very accurate approximations of the pattern entropy 



4 Bounds for Small and Large Alphabets 

This section studies pattern entropy with bounded letter probabilities 6i > l/n^^*", Vi (i.e., fcoi = 0). 
Upper and lower bounds are presented. 

4.1 An Upper Bound 

Theorem 1 Fix 6 > 0. Let n ^ oo ande> (1 + (5)(lnlnn)/(lnn). If Oi > Mi,l<i<k, 

He (^") < nHe (X) - (1 - e) ^ log {hi) + oik). (29) 

b=2 

The bound can be tightened by substituting e in the second term by exp{— [O.ln^ — 2 Inn]}. The 
grid Tj, which is used for the proof, is deflned with the same e. Theorem [1] shows that letters whose 
probabilities are in the same bin of r/ can be exchanged in a typical x" generating sequences x'" 
with Pe (x"^) « Pe (x") and ^ (x"^) = ^ (x"-). This increases Pe (x")] by a factor of the total of 
such possible permutations, and decreases the entropy by its logarithm. Summation in the second 
term of (|29p is only up to because larger index bins contain at most a single symbol probability. 
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Proof : The proof separates typical from unlikely (untypical) x^. Then, Pg {ip"') is lower 
bounded by the sum Pq {x^) of typical with ^ (x") = ■(/'"'• For all such x", Pq (x") is almost 
equal. The number of such sequences results in the entropy decrease and is equal to the number of 
possible permutations within the bins of t}. 

We define a typical set. Let be the Maximum Likelihood (ML) estimator of 6 from x". Then, 

A ^ 



: Vi, 



: 3i, 



< 



1^ 



> 



2^^ 



l-e 



(30) 
(31) 



Lemma 4.1 



A 



i^e (Ti) < exp{-0.1n^ + (2 - e)lnn} = e 



(32) 



The proof of Lemma l4.ll is in Appendix A For the choice of e in Theorem [U e„ — > 0. 



Now, define S as the set of all permutations cr that permute symbols only within bins of ?7, i.e 

A 



S = {cT : Oi £ {r]b,r]b+i] ^ e{ai) £ {rjb,'nb+i] , Vi = 1, 2, . . . , /c} 
The definition of S is independent of x"', and depends only on 6. 



(33) 



Lemma 4.2 Let x"- £ %■ and cr £ S. Then, 



m — : r < 



0{k). 



(34) 



Pe{a) (a;") - 

Lemma 14.21 shows that the probability of a typical x" given by a permuted parameter vector in S 



diverges only by a negligible factor from Pg (x"). Its proof is in Appendix B 



Let Mg^fj be the number of permutation vectors cr in S. Using (j34p . for x" G 7^, 

log (x")] > log (x") + log M^,^ - o (fe) , 

Mg,^ = \S\ = llkbl=l[kbl. 

b=2 6=2 

Hence, applying (f35]l and Lemma HTD (step (a) below), 

/7e(^") = Pe{x'')\ogPg[^{x^)]- Pe{xn^ogPg[^{x^)] 

< He (X") - (1 - e„) log Me,^ + o(fc) 

< nHg{X)-{l-en)Y,^og{kb\) + o{k). 

b=2 



(35) 
(36) 



(37) 



The proof of Theorem [T] is concluded. □ 



4.2 Lower Bounds 

Theorem 2 Fix 6 > 0. Let n — > oo ande> (1 + (5)0nlnn)/(lnn). If Oi > l/n^'^, Vi, 1 < i < /c, 

Hg{W'')>nHg{X)-^log{Kb\)-klog3-o{l), (38) 

6=1 

and also 

He > nHe {X) - J] log (4!) - o(l). (39) 
b=i 

The two bounds above are very close and except one step are proved similarly. The bound of 
(j38p does not count occurrences in a given bin more than once (except the correction term of 
/clog 3). However, there exist distributions, such as the geometric distribution (see, e.g., [13]). 
where components of sparsely populate bins, for which the bound of ()39p will be tighter. The 
last term of o(l) decays at an exponential rate of 0(e„nlogn), where e„ is defined in ([32]) . The 
pattern entropy is shown to decrease by logarithm of the number of permutations within bins of ^. 

Proof : Define the set of typical patterns as 

T^ = {r ■■ 3x"er,,V" = ^'(x")} (40) 

the set of patterns, each of at least one typical sequence as defined in ([30]) . Now, for ip"^ £ let 
Mg^^ (i/)") = |y" £ Tx : -0" = ^' (y")| be the number of typical sequences that have the pattern ip", 
and let Me,^ and Mg ^ denote upper bounds on Mg^^ {tp^'') for £ 

Lemma 4.3 Let tp"" £T^. Then, 

A 



A 
b=l 

The proof of Lemma 14.31 is in Appendix C It now follows that 



b=l 

A 



Hg{^") = i70(X")-if9(X"|^") 

> He (X") - Pg {%) Hg (X'^l^", - Pg {%) Hg {X''\^^ , %) - [Pg {%)] 

(b) 

> He (X") - log Me,^ - e„ log k\ - o {nsn) 

= nHg{X) - log Mg^^-o{l) (43) 
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where (a) follows from the chain rule, namely, 

He (X" I = He (X", T | ^'") = Hq (X" | T) + /^e [T | "] (44) 

= Pe {%) He (X"|^",T,) + Pe {%) He {X^\^^,%) + He[T \ vf"] , 

where T is a Bernoulli random variable, taking value 1 if 7^ occurs, and the last term of step 
(a) of (f43l) follows since conditioning reduces entropy. Step (6) of (l43]l follows from (7^) < 1, 
F0(X"|^",T^) < logMe,^, Lemma El He {X'^\^'' ,%) < log/c!, and from h2[Pe{%)] = o(ne„). 
Substituting Mg,^ from (|4ip in (j43p yields the bound of (j38p . Similarly, using M'q ^ from (j42p yields 



5 Bounds for Very Large Alphabets 

The more general case is now considered, where there exist alphabet letters with very small prob- 
abilities that may not occur in x". Specifically, the effect of such letters on He (^'") is considered. 

5.1 Upper Bounds 

General upper bounds are derived by designing a low-complexity (non-universal) sequential prob- 
ability assignment method for ip"', whose average description length serves as an upper bound on 
He (^'"). Instead of coding -i/'" by itself, the pair (-;/'", /?") is jointly coded, where /?" represents the 
sequence of bins corresponding to letters in x". Different grids produce different bounds. Examples 
and study of pattern entropy for specific distributions in 13j demonstrate that tightness depends 
on the specific source distribution. One bound may be tighter for one and another for another. 

Theorem 3 Fix 6 > Q. Let n ^ oo and e > (1 + 5)(lnlnn)/(lnn) (also for ij in Then, 
Hei^^) < nHS>^'\x)-{l-e)^logih\) 



+ {nipi - Li) log [minj/ci, n}] + n(^i/i2 i 

\nipij 



The bound in (j45p consists of four major components: 1) the i.i.d. entropy in which bins and 1 
of ?7 are each packed into a point mass (the first term), 2) the gain in first occurrences of symbols 
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i with 6i > l/n^~'^ (the second term), 3) the loss in packing bin 1 (the next two terms), and 4) the 
loss in packing bin (the last term). The sum of the third and fourth terms in (j45p decreases with 
Li for ki > (1 + e)rf , thus Li can be replaced by a lower bound as in (|20p . 



If the symbols in bins and 1 formed by r/ are packed into a single point mass, a simpler upper 
bound that uses Hf'^ and ipoi instead of Hf'^\ and both (/?o and ipi, respectively, can be obtained. 
Using T instead of t] produces other bounds. 

Corollary 1 Fix 5 > 0. Let n ^ oo and e > (1 + 5) (In Inn) /(Inn) (also for r/ in Then, 

b=2 

{n(foi - Loi ) log [min { fcoi , n}] + nv?oi /i2 ( ) • (46) 

\nipoiJ 

Let n ^ oo and e > 0. Then, 



2 



6=1 m=0 ^ ° 6>1,C6>1 

The bound in (|48p is in many cases the tightest but is harder to compute. It can be simplified 
using Stirling's approximation, 

V2^{^y<ml<V2^{^y. eVa^"^), (49) 

and Jensen's inequality, at the expense of loosening it, by replacing the inner sum in its second term 
by {Eg[Cb]) log{{Eg[Cb]) /e}, where EQ[Cb] is the expected distinct letter count in bin b of r. The 
bounds in (|45p and (j46p trade off two costs: (j45p has a larger first term, while ()46p pays a higher 
penalty in its last two terms. The better trade off is distribution dependent. Roughly, if letters in 
bins and 1 of r/ are better separated, (j45p is tighter, while otherwise (j46p is tighter. The bound 
in (I47p is the simplest, but ignores gains of first occurrences of letters with large probabilities. Its 
best use is thus for fast decaying distributions. Both (j47p and (j48p may be tightened in certain 
cases by separating low probabilities (bin of rj) into two or more regions (see, e.g., [I3]). The 
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next examples illustrate tradeoffs between the bounds. 



Example 1: For a uniform distribution with k = ki = v} ^ parameters Qi = ^, where 

0<u<e, 



He i^'') < n log n^-" - n^-" loj 



n 



l-2u 



e in 



l-2u\ 



l-u _ ^X-u 



log - — + e (n^-^-^) . 

e ' 



with (jl5|) and ([Ml). Bound (|17|) produces only the first term. Then, with the loosened dM]), 

Hei^") < nlogn 
The last bound from (j48p is the tightest. 



(50) 

(51) 
□ 



Example 2: Let 6 consist of two sets of probabilities: kg = tpQU^'^'^; fj,> e, probabilities of l/n^^'^, 
and ki = ipiii}^'^; Q < v < e, probabilities of l/n^"'^, where tpQ + ipi = 1. Then, 

{1 — v) nipi log n — n(^o log (^o — © (n log n) , using (jl5]) and (|l8]) 
-f^e (^") < < n99i log n + n/i2 {lpq) - O {n^-"" log n) , using (gB]) (52) 

(1 — v) nipi logn — mpQ log 920 + © (n^~'' log n) , using (jTr|) . 

The bound from (j46|) is looser because it ignores the clear separation between bins and 1. The 
gain ignored in (j47p also slightly loosens the resulting bound. The greater ipQ is, the smaller Hq {^^) 
is from nHg{X). □ 

Example 3: For a given e > 0, let 9 consist of two sets of probabilities: k^ = ipQU^^f^; /u > e, 
probabilities of l/n^+'^j and ki = ipin^^'^; < < e, probabilities of , where ipQ + ipi = 1. 

Here, (j45p results in a bound of n/i2 (930) + O {v}^'^ log n) . A much tighter bound of G {v}~'^ log n) 
is produced by (06]). This is because the two sets here are of "small" probabilities. Looser bounds 
of (nlogn) are produced by (j47p and the loosened (j48p . with a smaller coefficient for the second. 
However, since e > for these two bounds, e < v can be used to produce similar bounds to that of 
l6|) . Such fiexibility is limited with the other bounds that have positive lower limits on e. □ 



Example 4: Let 6 consist of two sets of probabilities: /cq = ^^n^^^; ^ probabilities of l/ii}^^, 
and ki = Lfin probabilities of 1/n, where (po + fi = 1. Then, 



He < { 



^ log n + n [h2 (ifo) + ipih2 (i) + ^ log + O (n^"^ log n) , 
^logn + n/i2 (^) +0 (n^-^ logn) , 
n(^i log n — mpo log 990 + O (n^~^ log n) , 



with (05]) 
with (06]) 
with (071) 



^logn + n 



h2 i^o) + ^ log + ifi log ^ 



+ O (n-*^ log n) , with 



(53) 
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Again, the tightest bound is from (j46p . implying that all probabilities here are still "small". The 
next is that of (jl5]) . Unlike the other examples, (|^8|) and ([^7|) lead to the loosest bounds. If (08]) is 
used with e = 0, an even weaker bound with first term 0.5(pin\ogn will result because of the use 
of the upper bound of ()18p for mean re-occurrence count, which is looser than that of ()16p . Using 
(jl6l) instead for the last term of (j47p -(j48 p yields the bound of (j46p for this case. □ 

In Theorem[3l Corollary[Tl and the examples above, contributions of small probabilities influence 
the pattern entropy. The next corollary shows the limits of these contributions. 

Corollary 2 /. The total combined contribution of all letters with < to Hq (^"') beyond 

the term —nipQi log (/jqi of nH^^^ (X) is upper bounded by the maximum between O {in?'' log n) and 

uifQi log ^01 + (^ol?^^~^log ^ + Upoin^^^e^'"-') = O (nc^oi logn) , 

Similarly, the sum of the third and fourth term in ( [^5[ j is O (max {n^?!, n^^} log n) . 

//. The total combined contribution of all letters with Oi < \/n^^^ , for any /i > 1, beyond the term 

—uifQlogipQ of nHg^'^\x) is O (n^~^~^ log n) . Similarly, the last term of is upper bounded by 

,1-e 



log {2en^^'^) = O {jn} '^ipo log n) = a (n) . 



Corollary [2] is proved in [Appendix E It shows that the per-symbol (normalized by n) contribution 



of bin of 77 beyond a single point mass is diminishing. Furthermore, any letter with 9i < 1 

has diminishing contribution to the block entropy beyond that of the single point mass of bin 0. 

The subsection is concluded with the proof of Theorem [3] and Corollary [TJ 

Proof of Theorem [3] and Corollary (TJ For some x", let V" = ^(2;"), and define = 
iPi,P2, ■ ■ ■ , Pn) by (3j = 6 if 0xj £ {r]b,r]b^i]. The joint sequence (V'",/?") is sequentially assigned 
probability 

n 

Q[{r,n] = 1 (V''"'"^'"')] ' (54) 

where 



PP, if (^„/5,)G (V'^-\/?^-'), 

ffi, - kp^ [(V'^"\/3^"^)] • pp^ if ipj = maxlV-i, . . • , V'j-i} + 1, 

otherwise, 

(55) 

where pb = fb/kb for b>2,pQ and pi are optimized later, and kjs. [(^■'~^, /J-'"^)] is the number of 
distinct indices that jointly occurred with bin index (3j in {'ip^~^ , (e.g., if ip^~^ = 1232345 and 
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/JJ-i = 1222242 then kp. is 3 for fij = 2, 1 for f3j = 1 and fij = 4, and is for any other 

value of Initially, every bin h is assigned its total probability (^b- Each new index occurring 
with a letter in bin h is assigned the remaining probability in bin h for its first occurrence. For any 
re-occurrence, it is assigned the average symbol probability in h; pj,, unless 6 < 1, where a different 
(smaller) value which favors first occurrences is used for pb. After a new occurrence of a symbol in 
bin b, ph is subtracted from the remaining bin probability. 

Since joint entropy is not smaller than the entropy of one of the components, 

He {^'') < He (^'", ,6") < -E log Q (*'^, i3") 

n 

i=A;oi+l &=2m=0 ^ " 

1 ( imn{n,kf,} m-1 

- ^ I {nifb - Ee [Kb] ) log pb+ ^ Pe {Kb = m) ^ log (ipb-lpb)}, (56) 

6=0 I m=0 1=0 I 



Rb 

where pb (Oi) is the mean symbol probability in bin b, where 9i G {rjb, ??6+i]- Equality (a) is obtained 
as follows: The first term is the coding cost of "large" probability letters. The second term describes 
the gain of first occurrences of these letters. The first symbol occurring in a bin is assigned 
probability kbPb at first occurrence, the second {kb — l)p6) £^nd so on. The remaining terms Rb 
describe similar costs for bins and 1. The first element for each is the re-occurrence cost. The 
second is the first occurrence cost. Bounds on all terms are summarized below. 



Lemma 5.1 



k "'^ n 1 

n J] eilogpbiei)<nH^°^'\x)+nY^blogVb + ^^- Yl (57) 

i=kai+l 6=0 b>2,kb>l 

Bri kt , I An 

- V V (i^6 = m) log < - V [1 - ^6 exp {-nr]b}] log (fc„!) (58) 



The optimal choice of pb; 6 = 0, 1, is 



{rvpb - Lb) (fb {rvpb - Lb) , , 

Pb = ^ = — r- (^^) 

nipbcb n ■ mm \kb, n\ 



With this choice, 



Rb<-nipb\ogipb + {nipb-Lb)log[Tcan{kb,n}]+nipb-h2[^^], 6 = 0,1, (60) 

n'fb 
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which decreases with Lf, for 6 = and also for 6 = 1 if ki > (1 + e) n^. Specifically, 
D ^ 1 , f^^\^Q2\, 2e-ipo-mm{ko,n} 

Ro < -n^o log (/^o + Y 2^ 0t log ^ . (61) 

Lemma [5.1l is proved in Appendix D Summing (j57p . (|58p . (j60p for 6 = 1, and (|6ip yields (j45p . where 
the last term of ([57j) and the decaying terms in ([58]) . which decay at least as fast as ktexp {—n^} 
each, are absorbed by the leading e of the second term in (j45p . The decrease of (j60p with Li implies 
that the lower bound on Li of (j20p can be used in ()45p as long as /ci > (1 + e)n^ . 

Corollary [1] follows from similar steps. To prove (j46p . bins and 1 are grouped to one point 
mass. Then, ([57]) and ([60|) are adjusted with Hg^^\x), /cqi, Lqi, and (/?oi, and summed together 
with ([58]) to produce (|16]) . Bound is obtained by packing bin of r into a point mass, but 
coding each "large" probability symbol as an independent bin. If, in addition, the "large" proba- 
bility bins of T are coded as in proving (j45p . an additional gain as the left hand side of (j58p w.r.t. 
T is achieved. Using r, the denominator of the last term of ([57|) is (as can be seen in ()D.ip ). □ 



5.2 Lower Bounds 

The main difficulty in deriving a general lower bound on Hq (^'") is separating between "small" 
probabilities 9i < l/n^~^, whose symbols i may or may not occur in X", and "large" probabilities, 
for which the results of Theorem [2] can be used. The key idea is to use an auxiliary Bernoulli 
indicator random sequence to aid in the separation. 

Theorem 4 Fix 5 > 0. Let n — > oo and e > (1 + 5) (In Inn) /(Inn), define ^ with Define 
= (Zi, Z2, . . . , Zn) by Zj = if < l/n^~^ , and 1 otherwise. Let k^ be the count of letters i 
such that 9i G {{}~ /n'^^^ ,l/n'^~^^^ and k'^ the count of letters i with Oi G {l/n'^~^ ,"0^ /n'^^^^^, where 
i}^ , i}^ are constants that satisfy 1?+ > 1 > > 0. Then, 

He {-^n > uHf^^ {X) - 5i + 52 + 53 - ^4 - o(l), (62) 

where 

Si < ^log(H!) + (A:-Ko)log3, (63) 

b=l 
^« 

5i < ^log(4!), (64) 

6=1 
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koi 

S2 = Y.Ee[N,{{}-PeiiGX^)]log^, (65) 



i=l 
fcoi-l 



S2 > 5: p.-l + e-"(^^+^?)]log^ + (nV-l)logf^, (66) 



i=l 



2 ^=0 



^ ^ i=l ' i=A:o+l 



log?i, (67) 



53>(loge) V (Loi-i)— . 



S,<log{^'ti^'y (69) 

Theorem m lower bounds Hq (^'") in terms of H^^\x) with several correction terms, two of which 
are provided more than one bound. Term Si shows the decrease in Hq {^^) due to first occurrences 
of letters i with 9i > l/n^^^. Its bounds are similar to the correction term in Theorem [2l Term 
S2 is the cost of re-occurrences of letters with "small" probabilities. Separation of the last term in 
()66p from the sum is only necessary if ^^p^ > 3/5 (see and discussion following it). Equation 
(j67p separates the sum of (j65p into bins and 1, where the additional term of (j66p can be added 
to tighten the bound. Term 5*3 is the penalty in first occurrences of "small" probability symbols 
beyond the single point mass they are packed to. Its bound in (j68p is obtained under a worst case 
assumption and may be tightened. Term 6*4 is the correction from separation to "small" and "large" 
probabilities. The last term of — o(l) absorbs all the lower order terms. By proper equalities, (j62p 
can be brought into several other forms including forms in terms of Hg^\x) and Hq^'^\x). 

Proof : Using Z", 

He (*") = Hg (^" I Z") + He (Z") - Hg (Z" | ^'") . (70) 

By definition of Z", 

Hg (Z") = n/i2 {(foi) = -(/3oif^log(^oi - (1 - m)"-lc)g(l - Vol) ■ (71) 



The third term is bounded in the following lemma, which is proved in Appendix F 
Lemma 5.2 

^4 = Hg (Z" I ^'") < log \+ M + o(l). (72) 
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To bound the first term of (j70p . define two new pattern sequences tp"^ and V'"- The first is defined 
as tpj = if Zj = 1, and for the second tpj = if zj = 0, where (p is a do not care symbol. The 
other components of both -0" and ■0" are the patterns of the remaining symbols in x", respectively, 
i.e., ■0" and ■0" are the patterns of low and high probability symbols occurring in x", respectively. 
In a similar manner, define x" and x", where Xj = if Zj = 1, xj = xj, otherwise, and xj = if 
Zj = 0, Xj = Xj, otherwise. Now, 

He I Z") = He | Z") + I ^'') (73) 

because up to deterministic labeling of pattern indices, the uncertainty on both sides is equal. 
Following the same steps in (jl3|) . 

He I Z") = He (x" | Z") - {Pe {% \ Z") (l" | Z", T,.) | Z"} - 

{i^e {% I I T,) I - (T I Z") 

(a) 

> (1 - ^oi) nHe (X I Z = 1) - 5i - £„ log(A; - kq)! - o (ne„) (74) 

where the external expectation is on Z", and 7^ and T are as defined in Section [H Now, can 
be upper bounded by either (f63l) or (fMIl following bounds similar to (l4T]) and (l42]) . respectively, 
He{X I Z = 1) is the i.i.d. source entropy given only letters with 9i > l/n^~'^ are drawn, and 
(a) follows from Pe {% \ Z") < 1, He Z", T,) < Si, He (x"!^'", Z", T,) < log(A; - kq)!, 

and then Ee {Pe {% \ Z") | Z"} = Ee {Pe {% | Z")} = Pe {%) < £„. Finally, He (T | Z") < 
/i2[Pe C?;)] = o(ne„). Summing (HI]) and ([711), 

(1 - ifoi) nHe (X I Z = 1) - 5i - o(l) + n/i2 (^oi) = nH^°^\x) - - o(l). (75) 
With the chain rule, and data processing, 

n n 

He (^'^ I Z") = ^He | ^^■-\Z") > ^^^^9 (^j I X^■-^Z") 
i=i i=i 

fcoi 

-Y^Ee {Ee [X,(i) - (^ G I ^1} log I ^ = 0) + ^3 
fcoi 

y ^ii;,[iV,(z)-P,(iGX")]log^+53 (76) 



where S2 is the average cost of re-occurrence of letters i with 9i <\/'n} ^, and 53 is the average cost 
of first occurrences of such letters. Step (a) follows from rearranging the sum into re-occurrences 
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and first occurrences, where each is expressed over all (small probability) alphabet symbols, (b) 
follows from E {E [U \ V]} = E [U], for random variables U and V, and since — log Pg {i \ Z = 0) = 
log ((/?oi/^j)- This yields ([65]) . Then, ([66]) and ([67]) follow from ([T6]) and ([TH]) . respectively, where 
the preceding in the first sum of (I67p follows from the lower bound in (jlSp . Now, 



(a) 

S3 > — E'e < -Eg 



Koi-l 



> iloge)Ee{Eg 



Koi-l i 

E E 

i=0 j=l 



Loi-1 j 



> (loge)- E E— = E (^01-0^, 



Loi-l 



(77) 



where (a) follows because each new occurrence of an index is allocated the remaining total prob- 
ability, where in the worst case, letters occur in ascending order of probabilities, (6) follows from 
— log(l — x) > xloge, (c) follows from Jensen's inequality, where the function is convex in A'oi 
because each increase in Kqi results in no smaller increase of the expression than the previous 
increase in Kqi. Then, EqKqi = Lqi is used. Finally, (d) follows rearrangement of the double sum. 
The proof is concluded by combining ([75]) . ([76]) . and ([72]) to obtain all components of ([70]) . where 
the bounds on all terms are provided in ([63]) - ([69 ]) . □ 



6 Entropy Range 

Bounds presented so far depend on the arrangement of probability parameters in the probability 
space. However, can we say more than ([1]) about the pattern entropy without knowledge of this 
arrangement? The answer is yes for large enough k and sufficiently large letter probabilities. There 
are O 71^+^^ bins in r. Due to the constraint = 1, very few of the larger parameter bins 
are populated, essentially leading to O (n^/^"*"^) populated bins. If k is greater, this forces more 
than a single letter probability to populate a single bin, thus decreasing the pattern entropy. The 
range of values the entropy can take is bounded below. 

Theorem 5 Fix 5 > 0. Let n —^ cxd, and e,ei > (1 + 6){ln\nk)/(lnn). Let 9i > l/n^~^^, 
\/i,l<i<k, and let k > 71^/2+^ Then, 

nHe (X) - log (A;!) < He (^") < nHe (X) - hlog (78) 

The bounds of Theorem [5] give a range within which the pattern entropy must be. For alphabets 
with k > n^/^+^, the entropy must decrease essentially by at least 1.5 log (fc/n^/'^) bits per alphabet 
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0.5 1 1.5 2 2.5 3 3.5 4 

Alphabet size - k 



4 5 6 

Alphabet size - k 



Figure 1: Region of decrease from i.i.d. to pattern entropy vs. k for n = 10^ bits, e = 0.2, n^^ = 20 
(left), and for n = 10^° bits, e = 0.1, n^i = 1000 (right). The sohd white curve on the left describes 
the asymptotic decrease expression in (j78p . On the right, it overlaps the non- asymptotic upper 
bound from 



symbol. All low order terms can be absorbed in the denominator e/2 exponent. Alternatively, a 
term of O (/clog log n) can be included, and the exponent is reduced to e/3. Asymptotically, ei 
and e can be equal. However, for practical n, different values may be required to guarantee 
occurrence of all letters, and that low order terms do not overwhelm the decrease in entropy. 
Figure [1] demonstrates the region of decrease in the pattern entropy w.r.t. the i.i.d. one. The upper 
region bound shown is the non-asymptotic one given in (j86p when proving (j78p . Smaller order terms 
influence the region for practical n. The tightness of (jTSp depends on the particular source. For 
uniform sources, the lower bound gives the true behavior. For sources with monotonic parameters, 
the upper bound gives a more accurate behavior, as demonstrated in the following example. 

Example 5: Let k = df3 > n^^^'^^/^, where for b = 1,2, ... ,13 there are d letters with probability 
ib- Hence, Y^i^i = Efe=i ^^V"^^' = 1- For n^-^ » k, since d/3 = k, p > - o(l)). 

This leads to 

A, 



^log(K;,!) = log 



< (1 + o(l)) Hog A = (1 + o(l)) ^Hog (79) 



b=l 

Substituting ([79]) in ([38]) (with the third term of ([38]) omitted because the letters in adjacent bins 
are sufficiently spaced), the resulting lower bound asymptotically achieves the upper limit in ([75]) . 

□ 
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Proof of Theorem [5} The upper bound is proved by deriving an upper bound on the second term 
of (j48p in CoroUarylTl which is determined by a lower bound on Mg^^, foUowing a similar bound 
w.r.t. T to that in (jSSp . Since 9i > l/ii}^^^ > , only the first three terms of (j48p exist. The 

first equals nHQ{X) because = 0. Now, for an arbitrary (3 < Br, the set of bins formed by r 
is partitioned into two parts: all bins up to [3 and all others. The maximal possible number of 
components of is allocated to the second group, and the remaining components are distributed 
in the first group so that Mg^^ is minimized. Then, since this holds for every (3, (3 that maximizes 
the lower bound on Mg^^ is chosen. 

For convenience, denote A = n^+^. For 9i G (T^,r^+i], 6i > Tp. From ([5]) and since 6*^ = 1, 



it follows that YlbZa = k — Ylb=i < A/ 0^ . Hence, 



A 



b=l 



(80) 



An infimum on Mg^j- is obtained by uniformly distributing k — Aj 0^ symbol probabilities in (3 bins, 
where the remaining symbol probabilities are uniformly placed in all bins of t (this is a lower bound 
because it may violate Yli^i — !)• Following an equation similar to (f36]) w.r.t. r. 



Mg^r > 



k-A/p^ 



for every (3. Applying Stirling's approximation ([^9]) . 



, k-A/(3^ f3^ 2Tr (k-A/p^) 
lnM,.>(.-^jln^^ + fln^-^ ' 



By differentiation, (j82p is shown to be maximized hy (3 = y^^A/k, where 7 > 2 satisfies 



7 



In 



(7-1)^ 



+ ln- 



ik^ 



Y A 

For k > n^/^"*"^, this implies that 7 must increase at O(lnn). Thus, to first order. 



'opt 



aA , k^ 



(81) 



(82) 



(83) 



(84) 



where a is a constant, asymptotically optimized slightly below a = 1. (The exact value of a only 
affects second order terms.) Plugging (j84[) with a = 1 in (182p . 



k 



1 



2'^""enV3+e/3 2 

as long as A; > n^/'^+^. Plugging (|85p in the second term of 
the probability of no occurrence of any letter, 

k 



In In 



k'^ 



n 



l+e' 



(85) 



He{^>^)<nHg{X) 



T - ke 



2''^°^enV3+e/3 



using the upper bound of on 

9fc log e 



1 



ln-¥ 



fe3 



1 1 



+ 



(86) 
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With the vahd choices of e and ei, all lower order terms can be absorbed in a term of 0.25eA;logn 
for some sufficiently large n, and the upper bound of (j78p follows. □ 



7 Summary and Conclusions 

The entropy of patterns of i.i.d. sequences was studied. Tight upper and lower bounds as function 
of an i.i.d. source entropy, the alphabet size, the letter probabilities, and their arrangement in 
the probability space were derived first for distributions with bounded probabilities, and then for 
the general case. The bounds demonstrated the range of values the pattern entropy can take, 
and showed that in many cases it must decrease substantially from the original i.i.d. sequence 
entropy. It was shown that low probability symbols contribute mostly as a single point mass to the 
pattern entropy. However, an additional correction term is necessary. Very low probability symbols 
contribute negligibly over the contribution of a single point mass. The bounds obtained can be used 
to provide very accurate approximations of the pattern block entropies for various distributions as 
shown in a followup paper |13j . 



Appendix A — Proof of Lemma 14.11 

The set = IJi -^j ) where 



Using large deviations analysis of typical sets [2], [3], 

Pe i^i) < n ■ 2""™"^^; D(^§i\\ei) ^ ^ _ 2-"rn™[-^(^!+'^ill^O:O(ei-d,||0i)] (A. 2) 

where di = \'%/{2y/n^ ^) and D (^9i\\0i^ is the divergence (relative entropy) between the two 
Bernoulli distributions given by 9i and Oi, respectively. The coefficient n is a bound on the number 
of types. Using Taylor series expansions, for n^^^^^^ < Oi < 0.5, 

D{e,±di\\ei) = {ei±d,)iog(i±^] +{i-eiTdi)iog(iT 



1 



2ei V mj 2{i-9i) V 3(1 -Oi) 

W Mog^ Jog^ 
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where it and =p are used respectively to compactly describe both cases. Step (a) is obtained by 
combining the first three terms of the expansions for each of the two logarithmic expressions. 
The first terms from both expansions cancel each other. Under the assumptions bounding 9i, the 
remaining terms are all nonnegative, yielding a lower bound. Plugging in the value of di, bounding 
9i, the second term is now nonnegative negligible. For the worst case, 1 — di/{39i) > 5/6, leading 
to (6). Using the relation between divergence and Li distance (see, e.g., [2]), for 9i > 0.5, 

D {6, ± dm > ^ m ± d,) -e,\\l = 2(loge)d2 > ^ > (A.4) 

where ||(^j it di) — 9i\\i is the Li distance between the Bernoulli distributions defined by Oi^di and 
9i, respectively. Applying the union bound on the bounds in (|A.3p and ()A.4p plugged in (IA.2p . 

Pe {%) <k-n- 2-0-i(i°g^K < exp{-0.1n^ + (2 - e) Inn} , (A.5) 

where the second inequality follows from the limit on 9i implying k < n^^^. The bound is meaningful 
for e > (In Inn + In 20) /(Inn) and diminishes for e > (1 + (5) (In Inn) /(Inn). □ 



Appendix B 



Proof of Lemma 14.21 



For a source 6, a permutation vector a, and a sequence x", define 

k = e,-e{ai). 

Then, by the conditions of the lemma, the definition of S in ()33p . and by 



By the triangle inequality, 

h = 

(a) 
< 



6, - e (a,) 



\6^\ < 



< 



(7i 



n 



l+2e 



+ 



, 3^^!^ _ ^e{ai) + 5, , 3^^ 



2^ 



1-e 



1^ 



+ 



+ 



l+2e 



2^ 



1-e 



+ 



l + 2£ 



1^ 



n 



l+2e 



5 ^JJWi ^^fmo^)^ W I^W^ 



1-e 



1-e ' 



(B.l) 
(B.2) 

(B.3) 



(B.4) 



In 2n3/4 

(a) is obtained from ^ and (lR3l) . (6) is since \/o+T < for a, 6 > 0, (c) is from combining 

the first and last term and from (|B.3p . and (d) results from Qi > 1/n^^^ l/n^^^ < Q (cxi)^^'^. 
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Applying for x'^ € % and a e S, 

k 



In 



Pe {xn 



/ N (- k 



5i + 



n 



A 5 A W 6A: 



(B.5) 



where (a) follows from ln(l+x) < x, (b) follows from 6i = 6 {ai)+6i. (c) is because all displacements 
sum to 0, and (d) follows by applying (|B.3p - (|B.4p . □ 



Appendix C — Proof of Lemma 14.31 

Let cr = {cTj}'^^^ be a permutation vector. Then, is permuted by a to w'^ = a (x") = 
{a^,,a^2,...,a^J. For example, let x" = 333112222222 and cr = (3,1,2), then, = 22233111111, 
i.e., if (Tj = i, letter j in is replaced by i in w". In the example, a2 = 1, and j = 2 is replaced 
by z = 1. We show that if x'^,w'^ £ 7^, then letter i can replace only letters j whose probability 
parameters are in the same bin as 9i or in the two surrounding bins of ^. Then, the total number 
of such permutation vectors is upper bounded. 

Lemma C.l Let ip"' G 7^, and x" E 7^ such that ijj^ = ^'(x"). Let tw" = o" (x") such that 
£ Tx. For i;l < i < k, let 6i G {ib-,£,b+i\ o-^d let j be such that aj = i. Then, 6j £ {Cb~ii(,b+2]- 

Proof : The proof is by contradiction, 6j (^6-1,^6+2] contradicts x",w^ £ T^. By x'^,w^ £ 7^, 



< 



< 



2^ 



(C.l) 
(C.2) 



where 6j (x") and 6i {w"^) are the ML estimates of 9j and 9i from x" and w'^, respectively. By 
definition of j, di (if") = 9j (x"). By the triangle inequality, (jC.ip . and ()C.2p . 



< 



< 



Oi - e, (x") 



+ 



(«;") + % (x") - 6, 



+ 



2^/E 



l-e 



(C.3) 
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If 6j ^ (^6-1,^6+2]) it must satisfy Oj G (^^, ^/j+i] for some /3>6 + 2or/3<6 — 2. In the first case, 



(o) 2 (/? + 1 - 1.5) (fe) 271^ 



l-e 



n 



n 



l-e 



> 



/3+1 



-l-e 



> 



l-e 



> 



+ 



-l-e ' 



(C.4) 



(a) follows from with 6 = /3 - 1, (6) from §1 with 6 = /? + 1, (c) is because /3 + l>6 + 3>4 
and thus y^^/j+i > 4/-^/n"'^ ^, (d) is by the assumed range of 6j, and (e) is again by the assumption 
that 9j > 9i. For 9i > 9j, where /? < 6 — 2, in a similar manner, 

2(6 + 1 - 1.5) , ^/9^ + ^ 



'j- > 6 - 6- 



n 



l-e 



> 



l-e 



(C.5) 



The last inequality is obtained as ()C.4p by exchanging the roles of b and /?. Equations ()C.4p and 
(fas!) contradict (103]) . Hence, 9j e (^6_i,^b+2]. □ 



Equation ()42p follows directly from Lemma ICH since every letter i can only permute into 
{Cb-ii(,b+2]- To prove (jlTI) . a permutation replacing letter j with 0j G (^fe-i, '^{,+2] by letter i with 
£ (Cb) '^6+1] can be done in the following steps: For each two adjacent bins select how many and 
which letters are exchanged between the two bins and exchange the occurrences of these letters. 
Then, permute only within the letters in a bin for all bins. Then, 



(a) 



6=1 I Mb=0 I 13=1 



(b) ( '^b-'"b-i Kfc+l 

SHE ("' ■E(''S') -n'^^ 

6=1 I Ui=0 vi,=0 I 13=1 



A^ I Kt+i I A^ 

n^'"-"'--E(''s') -n-^ 

b=l I 1)6=0 I 13=1 



= 2- . n E {A) ■ 2"'-'- ■ E he) ■ n 

6=2 (^1)6-1=0 J ?;aj=0 ^ 13=1 

^^ 1 ^« (/) 

[b=2 J /3=i p=i 

Inequality (a) follows from the definition of the process above. Some permutations within adjacent 
bins lead to untypical sequences, yielding an inequality. There are up to min{Ki,K2} choices of 
exchanging letters between bins 1 and 2. (By definition uq = vq = 0.) For bin b, there are up to 
at most the number of letters in the bin not exchanged with bin 6 — 1 to exchange with bin 5 + 1. 
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The last product represents permutations within bins after exchanges. Inequahty (6) fohows from 
Yl ^i^i — S ■ S foi' (^i^^i ^ 0, and from increasing the hmit of one of the sums, (c) is a binomial 
series equality, (d) results from reorganization of terms such that the new general term originates 
from the second term at index 6—1 and the first term with index b. Binomial series relations (since 
S'^i) = (2 + 1)'^'') lead to (e), and upper bounding the sum of all Kb by k leads to (/). □ 



Appendix D — Proof of Lemma 15.11 



First, define 5i = 9i — pb (Oi). By definition of the pb (Oi), Oi and ph (Oi) must be in the same bin. 
Hence, by &,\5i\< 3^pb {Oi)/^^^^'. Then, 

k k k , r, \ 



^ Qi log Pb iOi) = -n ^ Oi log Oi-n ^ Oi log 

i=fcoi+l i=A;oi+l j=fcoi+l 

1 k / r 



iHf'^\x)+nY,^b\ogipb + n ^ (p, (^,) + J,) bg 1 + 



(, n ■ I, , 1 K Pb{6i 

6=0 «=fcoi+l 



(") Ik ^ 

< nHl°'^\x) +n'^ipblogipb + n ^ ipbi0i)+^i) — T^T^^^^ 

6=0 i=fcoi+l ^ 

< nFi°'^)(X)+n^<^,log(^, + ^^. Yl (D.l) 

6=0 6>2,fc6>l 

where (a) follows from ln(l + 3;) < x, and (b) is because the total divergence from the average in any 
bin is 0, the bound in ([9]), and since 5i ^ only when kb > 1. Equation ()57p is proved. Following 
the upper bound in ()13p and the union bound for each bin of t], and since kb < 1 for b > A^j, ()58p 
is obtained. 

Recall that £b = min {/cfe, n}. Then, assuming that the maximum of ib symbols in bin b occurred 



prior to any new occurrence, thus reducing the allocation to any new symbol by 

Rb <- {rupb - Lb) log Pb- Lblog{ipb - ibPb); 6 = 0,1. (D.2) 

While the bound is loose, it serves its purpose well because low probability letters are unlikely to 
reoccur. The minimum for (ID.2[) is attained with pb in (I59p . It is a valid choice of pb because it 
leaves positive first occurrence probability after ib — ^ first occurrences. Plugging ([59]) in 
yields (j6U|) . The following lemma is now required. 

Lemma D.l The bound in (6^) is decreasing in Lb for 6 = and also for b = 1 if ki > {1 + e) . 
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As a result of Lemma ID-H an upper bound on Rq can be derived from (|60p by lower bounding Lq 
using (pT|) . Substituting ([2T|) . using Taylor series expansion of log(l — x) leads to ([6T|) . 

Proof of Lemma ID. It The derivative of the expression in ()60p w.r.t. is log [(nt^f, — Lh) / (L^^b)]. 
It is thus negative and the function is decreasing if ruph — < Lfjif,. kh > n (for either 6 = or 
6 = 1), this means that mpb — Lf, < LbU, which is satisfied if Lf, > (pf,. Hence, we need to show that 
Lb — 'Pb > 0. Using the lower bound on Lb from (I20p and the definition of ipb, 

Lb-vb>kb- Y'^'^ + = - - ^\ ■ 

^'ie(»)6,»?6+i] «'iG{'?b,%+i] 

The function 1 — e~"^' — x is for x = 0. It increases until x = (In n) /n, and then starts decreasing. 
However, at the end of the bin 1 region, x = l/n^""^, it still attains a positive value which goes to 
1. Hence, since all elements of the sum in ()D.3P are positive, it must be greater than 0. 

\i n > kb for 6 = 0, i.e., Oi < we need to prove that (fco + 1) -^^o — nipQ > 0. Using the 

lower bound in (pT]) on Lq, 



(ko + l)Lo- n^o > kon^o - ih + 1) (2) Yl ^ " ^) ^ony^o > 0, (D.4) 

i=l 

where the middle inequality is since Yl^f — o(n(/?o)- This can be shown as follows: Let 
6i = ai/n^~^^ for a probability in bin 0, where < 1. Then, Yli=i '^i — 'fQ'n^^^- Now, 



ko ^ ko ^ ko 'n}~^ 

2)12^^ ^ ^ 2;^^"' = ^^^^ = o(n99o). (D.5) 



i=l i=l 1=1 

The second inequality is since at < 1. 

The last region is that in which (1 + e)n^ < ki < n. Since we consider bin 1, 6i < l/n^~^. 
Following the same steps as ()D.3p and using the bound in (j20p . 

k,L, - > k^ . I ^ " " T^) [ ' ^^'S) 

The function 1 — e~"^ — nx/ki is for x = 0. It increases until x = (ln/ci)/n, and then starts 
decreasing. However, at x = it still approaches at least e/(l + e) > if fei > (1 + e)n^. 

Thus, n^Ji — Li < A;i-Li = i'lLi, and the expression in ([60|) is decreasing in Li. □ 
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Appendix E 



— Proof of Corollary [2] 



The contributions of all 6i such that 9i < l/n^""^, l/n^"*"^ < Oi < l/n^""^ (third and fourth terms of 
(I45p ). and 6i < l/n^"*"^ (last term of (I45p ) considered in the two parts of Corollary [2] are bounded 
by the last two terms of (j60p for b = 01, b = 1, and 6 = 0, respectively (recall that (j60p also holds 
for b = 01). Applying Lemma iD.ll to bin b, this expression is decreasing in Lb, where for b = 1 
and 6 = 01 this is provided that ki, > {1 + £)rf . Thus a lower bound on yields an upper bound 
on these two terms. For A;;, < (1 + e)n^ in either case, the last two terms of ()46p are O (n^^ logn) 
because (pi, < kb/n^~^ < (1 +e)/n^~^'^. 

Now, Lf) is lower bounded similarly for 6 = 0, 1,01. Let 9m denote the maximal probability in 
bin 6, and denote the probability of letter i in bin 6 by 6*4 = OiOM, oti < 1. Using ([20p . 

Lb > h-Yl = 5Z " e^"^"°' 

i i 

> 5: a. (l - e-"^-) '•^ 1^ (l - e-"^-) (E.l) 

where (a) follows from 1 — > a(l — x) for < a < 1 and < x < 1, because 1 — x°' — a + ax 
equals 1 — a for x = 0, for x = 1, and is decreasing between x = and x = 1. Equality (6) follows 
from v^fe = X] = S Oii^M ^ — Vb/dki- Using Taylor series expansion, 

{n(pb- Lb)\oglb + rupb -112 i^^] < nyjfc log (4) + -Z^b log (E.2) 

\nipbj ib^b 

Substituting (lE.ip to lower bound Lb for 6 = 1 and 6 = 01 {6m = l/?^^""^). 

Lb 



{mpb - Lb) log ib + n(pb ■ h2 



ntpb 



< nyjft log 4 + 936ni Hog— + e(99fen^ =0(n(^fc log n), (E.3) 

concluding the proof for Part I. 

For 6 = 0, the second term of ([6T]) is increasing in ^ Of. Thus, using (jP.Sp . 

\ 1=1 / '^i^i=i^i 

For the other statement of Part II, first, if V^j < l/v}^^ , also 9i < l/n^^~^^, then, 
9 fco 



n 



■ E ^'1 ^ T^'"""' (^^"""'O = O (-'-'^"^ log n) (E.5) 
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by the same arguments of (jE.4p . Otherwise, 

y E 0H^n'^^-^, (E.6) 

where ^pq^ = 'Yle,<i/nt^+^ ^i' ^^'^ Yl\=i > 1/^^^+^"^, because 30j > l/n^+^ in bin 0. Therefore, 
(y E ^niog^g^<^n2-''-log(2en^'^+^^)=0(n^-'^-logn), (E.7) 
where io < n and (po < 1 are used. □ 



Appendix F — Proof of Lemma 15.21 

Four regions of 9i are considered: 6i < ^j/n^~'^, j = 1,2, and 6i > ^j/n^~^, j = 3,4, where 
{jj-j} = (■!?", 1, 1, "i?^), respectively. Let {vj} = (7^, 7^*^, 7~, 7^), respectively, where i?^ < 7^ < 1 
and 1 < 7+ < 1?+. Now, let 

: 3ei;ei, > for 9i < = 1,2, or §i < for Qi > = 3,4) (F.l) 

be the event that for 9i in one of the four regions defined above there exists an empirical ML 
estimate on the other side of the probability interval, that is separated from the boundary of the 
region of 6i by at least a complete interval between points in (??~, 7", 1, 7^^, -i?"^) /n^^"^. By typicality 
arguments and the union bound 

Pe (^) < n • A:,^>„-3 • 2-«--^^(^^ll^0 + ^— (F.2) 

where the additional term bounds the probability of re-occurrence 7~'n^ or more times of any letter 
with 9i < using (jlSp . the bound in ()E.6P with fi + e = 3, and Markov's inequality. Then, 

the union bound on the number of remaining letters (where kg.^^-'i denotes the total letters with 
6i > 1/n'^) and the number of types (at most n) produces the first term. If occurs in region j, 



D (eM) > ^log^ + 1 - ^ iog^3j — > 



Ujlog^ + (/ij - z^i)loge 



(F.3) 



where the second inequality follows Taylor expansion. The values of 7" and 7+ can be optimized 
to maximize the divergence in (jF.SP by trading off between j = 1 and j = 3 for 7^ and between 
j = 2 and j = 4 for 7+. This yields 
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where ± is used to denote both cases. Plugging these choices of 7^, ii T occurs 

(F.5) 



Diem] > 



f^±_i _ 1 

where the minimum is taken between the value of the expression for -d" and for Hence, 

P0{:F) < e'^^n- fe,^>,-3 • e-fi'-^'^y + ^^^l""^'^^^,^, , where (F.6) 
/(^-,^+) ^ min — -In— (F.7) 



^ lni?± e-ln??± 

Specifically, choices of = e'^-^ ^ 0.004 and i9+ = e^-^ 4.06 result in 7" « 0.18, 7+ 2.18, 
/ (??", 'i?'*') > 0.5, and an upper bound of 2.77 /n^^'^ on the last term of ()F.6p . 

Let F denote the Bernoulli event of whether event occurs. Then, 

Hg (Z" I ^'") < Hg (Z", F I ^'") = Hg (Z" | F) +Hg{F\ ^'") 

< Pg {:F) He (Z" I + Pe {J') Hg (Z" | J^) + Hg (F) 

(a) (K + K\ 

- J+e> + o(e», (F.8) 

where (a) follows since given the only uncertainty about is for indices for which 6i S 
(7~/n"^~^, 7^/n^~''] , because in all other regions it is guaranteed that 6i is on the correct side of 
l/n^~^, thus there is no uncertainty about the value of corresponding to such -0^. The only 
symbols for which it is possible to have 9i G (7^/n^^^, 7+/n-'^^^] are the + k'^ letters with 
6i G (^"d" /ii}'^ /^n}^^^^. The uncertainty in is choosing which such symbols correspond to 
z = 1, and the worst case is when the total possible choices of k^ out of + k^ are uniformly 
distributed. The second term is since Hg (Z" | ^'",.F) < n for the Bernoulli process Z". □ 
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